Sequencing and Raw Sequence Data Quality Control ◾ 15
command (make sure that you have installed the SRA-toolkit on your computer and it is
on the path):
mkdir fastqs
cd fastqs
mkdir single
cd single
fasterq-dump --verbose SRR030834
As shown in Figure 1.7, the FASTQ file “SRR030834.fastq” has been downloaded to the
directory, and we will use that file to show how to use some Linux commands to perform
some operations with that file.
FASTQ files may contain up to millions of entries, and their sizes can be several mega-
bytes or gigabytes, which often make them too large to open in a normal text editor. In
general, no need to open a FASTQ file unless it is necessary for troubleshooting or out of
curiosity. To display a large FASTQ file, we can use some Unix or Linux commands such
as “less” or “more” to display very large text file page by page or “cat” to display the content
of the file.
less SRR030834.fastq
more SRR030834.fastq
cat SRR030834.fastq
If a FASTQ file name ends with the “.gz” extension, that means the file is compressed with
“gzip” program. In this case, instead of “less”, “more”, and “cat” commands, use “zless”,
“zmore”, and “zcat” commands, respectively, without decompressing the files.
We can also use “head” and “tail” to display the first lines and last lines, respectively.
The following command will display the first 15 lines of the file:
head -15 SRR030834.fastq
If a FASTQ file is large, we can compress it with the “gzip” program to reduce its size more
than three times. Compressing the “SRR030834.fastq” file with gzip will reduce its size to
less than one gigabyte.
gzip SRR030834.fastq
The file name will become “SRR030834.fastq.gz”.
FIGURE 1.7 Downloading a FASTQ file from the NCBI SRA database.